When the WebsiteAgent receives Events, we do not need to require that they contain a url keyword

Andrew Cantino лет %!s(int64=9): %!d(string=назад)
Родитель
Сommit
8d8d8d614a
2 измененных файлов с 18 добавлено и 8 удалено
  1. 7 7
      app/models/agents/website_agent.rb
  2. 11 1
      spec/models/agents/website_agent_spec.rb

+ 7 - 7
app/models/agents/website_agent.rb

@@ -16,7 +16,7 @@ module Agents
16 16
     description <<-MD
17 17
       The Website Agent scrapes a website, XML document, or JSON feed and creates Events based on the results.
18 18
 
19
-      Specify a `url` and select a `mode` for when to create Events based on the scraped data, either `all` or `on_change`.
19
+      Specify a `url` and select a `mode` for when to create Events based on the scraped data, either `all`, `on_change`, or `merge` (if fetching based on an Event, see below).
20 20
 
21 21
       `url` can be a single url, or an array of urls (for example, for multiple pages with the exact same structure but different content to scrape)
22 22
 
@@ -37,7 +37,7 @@ module Agents
37 37
 
38 38
       # Scraping HTML and XML
39 39
 
40
-      When parsing HTML or XML, these sub-hashes specify how each extraction should be done.  The Agent first selects a node set from the document for each extraction key by evaluating either a CSS selector in `css` or an XPath expression in `xpath`.  It then evaluates an XPath expression in `value` (default: `.`) on each node in the node set, converting the result into string.  Here's an example:
40
+      When parsing HTML or XML, these sub-hashes specify how each extraction should be done.  The Agent first selects a node set from the document for each extraction key by evaluating either a CSS selector in `css` or an XPath expression in `xpath`.  It then evaluates an XPath expression in `value` (default: `.`) on each node in the node set, converting the result into a string.  Here's an example:
41 41
 
42 42
           "extract": {
43 43
             "url": { "css": "#comic img", "value": "@src" },
@@ -45,11 +45,11 @@ module Agents
45 45
             "body_text": { "css": "div.main", "value": ".//text()" }
46 46
           }
47 47
 
48
-      "@_attr_" is the XPath expression to extract the value of an attribute named _attr_ from a node, and ".//text()" is to extract all the enclosed texts. To extract the innerHTML, use "./node()"; and to extract the outer HTML, use  ".".
48
+      "@_attr_" is the XPath expression to extract the value of an attribute named _attr_ from a node, and `.//text()` extracts all the enclosed text. To extract the innerHTML, use `./node()`; and to extract the outer HTML, use  `.`.
49 49
 
50
-      You can also use [XPath functions](http://www.w3.org/TR/xpath/#section-String-Functions) like `normalize-space` to strip and squeeze whitespace, `substring-after` to extract part of a text, and `translate` to remove comma from a formatted number, etc.  Note that these functions take a string, not a node set, so what you may think would be written as `normalize-space(.//text())` should actually be `normalize-space(.)`.
50
+      You can also use [XPath functions](http://www.w3.org/TR/xpath/#section-String-Functions) like `normalize-space` to strip and squeeze whitespace, `substring-after` to extract part of a text, and `translate` to remove commas from formatted numbers, etc.  Note that these functions take a string, not a node set, so what you may think would be written as `normalize-space(.//text())` should actually be `normalize-space(.)`.
51 51
 
52
-      Beware that when parsing an XML document (i.e. `type` is `xml`) using `xpath` expressions all namespaces are stripped from the document unless a toplevel option `use_namespaces` is set to true.
52
+      Beware that when parsing an XML document (i.e. `type` is `xml`) using `xpath` expressions, all namespaces are stripped from the document unless the top-level option `use_namespaces` is set to `true`.
53 53
 
54 54
       # Scraping JSON
55 55
 
@@ -92,7 +92,7 @@ module Agents
92 92
 
93 93
       Set `uniqueness_look_back` to limit the number of events checked for uniqueness (typically for performance).  This defaults to the larger of #{UNIQUENESS_LOOK_BACK} or #{UNIQUENESS_FACTOR}x the number of detected received results.
94 94
 
95
-      Set `force_encoding` to an encoding name if the website is known to respond with a missing, invalid or wrong charset in the Content-Type header.  Note that a text content without a charset is taken as encoded in UTF-8 (not ISO-8859-1).
95
+      Set `force_encoding` to an encoding name if the website is known to respond with a missing, invalid, or wrong charset in the Content-Type header.  Note that a text content without a charset is taken as encoded in UTF-8 (not ISO-8859-1).
96 96
 
97 97
       Set `user_agent` to a custom User-Agent name if the website does not like the default value (`#{default_user_agent}`).
98 98
 
@@ -343,7 +343,7 @@ module Agents
343 343
               if url_template = options['url_from_event'].presence
344 344
                 interpolate_options(url_template)
345 345
               else
346
-                event.payload['url']
346
+                event.payload['url'].presence || interpolated['url']
347 347
               end
348 348
             check_urls(url_to_scrape, existing_payload)
349 349
           end

+ 11 - 1
spec/models/agents/website_agent_spec.rb

@@ -769,7 +769,7 @@ fire: hot
769 769
           @event.agent = agents(:bob_rain_notifier_agent)
770 770
           @event.payload = {
771 771
             'url' => 'http://xkcd.com',
772
-            'link' => 'Random',
772
+            'link' => 'Random'
773 773
           }
774 774
         end
775 775
 
@@ -826,6 +826,16 @@ fire: hot
826 826
           })
827 827
         end
828 828
 
829
+        it "should use the options url if no url is in the event payload, and `url_from_event` is not provided" do
830
+          @checker.options['mode'] = 'merge'
831
+          @event.payload.delete('url')
832
+          expect {
833
+            @checker.receive([@event])
834
+          }.to change { Event.count }.by(1)
835
+          expect(Event.last.payload['title']).to eq('Evolving')
836
+          expect(Event.last.payload['link']).to eq('Random')
837
+        end
838
+
829 839
         it "should interpolate values from incoming event payload and _response_" do
830 840
           @event.payload['title'] = 'XKCD'
831 841